What is this?

This notebook contains a set of analyses for analyzing GOBbluth89’s boardgamegeek collection. The bulk of the analysis is focused on building a user-specific predictive model to predict the games that the specified user is likely to own. This enables us to ask questions like, based on the games the user currently owns, what games are a good fit for their collection? What upcoming games are they likely to purchase?

1 The Data

1.1 Collection Overview

We can look at a basic description of the number of games that the user owns, has rated, has previously owned, etc.

1.2 Collection by Year Published

What years has the user owned/rated games from? While we can’t see when a user added or removed a game from their collection, we can look at their collection by the years in which their games were published.

1.3 What types of games does GOBbluth89 own?

We can look at the most frequent types of categories, mechanics, designers, and artists that appear in a user’s collection.

2 Modeling GOBbluth89’s Collection

We’ll examine predictive models trained on a user’s collection for games published through 2020. How many games has the user owned/rated/played in the training set (games prior to 2020)?

The main outcome we will be modeling for the user is owned, which refers to whether the user currently owns or has a previously owned a game in their collection. Our goal is to train a predictive model to learn the probability that a user will add a game to their collection based on its observable features. This amounts to looking at historical data and looking to find patterns that exist between features of games and games present in the user’s collection.

2.1 Decision Tree for GOBbluth89

One of the models we trained was a decision tree, which looks for decision rules that can be used to separate games the user owns from games they don’t. The resulting model produces a decision corresponding to yes or no statements: to explain why the model predicts the user to own game, we start at the top of the tree and follow the rules that were learned from the training data.

Note: the tree below has been further pruned to make it easier to visualize.

Decision trees are highly interpretible models that are easy to train and can identify important interactions and nonlinearities present in the data. Individual trees have the drawback of being less predictive than other common models, but it can be useful to look at them to gain some understanding of key predictors and relationships found in the training data.

2.2 Coefficients for GOBbluth89

We can examine coefficients from another model we trained, which is a logistic regression with elastic net regularization (which I will refer to as a penalized logistic regression). Positive values indicate that a feature increases a user’s probability of owning/rating a game, while negative values indicate a feature decreases the probability. To be precise, the coefficients indicate the effect of a particular feature on the log-odds of a user owning a game.

2.3 Visualizing Predictors

Why did the model identify these features? We can make density plots of the important features for predicting whether the user owned a game. Blue indicates the density for games owned by the user, while grey indicates the density for games not owned by the user.

Binary predictors can be difficult to see with this visualization, so we can also directly examine the percentage of games in a user’s collection with a predictor vs the percentage of all games with that predictor.

3 Examine Model’s Performance on Training Set

Before predicting games in upcoming years, we can examine how well the model did and what games it liked in the training set. In this case, we used resampling techniques (cross validation) to ensure that the model had not seen a game before making its predictions.

3.1 Top Games from Training Set

Displaying the 100 games from the training set with the highest probability of ownership, highlighting in blue games the user has owned.

3.2 Model Evaluation

This section contains a variety of visualizations and metrics for assessing the performance of the model(s) during resampling. If you’re not particularly interested in predictive modeling, skip down further to the predictions from the model.

3.2.1 Separation Plots

An easy way to examine the performance of classification model is to view a separation plot. We plot the predicted probabilities from the model for every game (from resampling) from lowest to highest. We then overlay a blue line for any game that the user does own. A good classifier is one that is able to separate the blue (games owned by the user) from the white (games not owned by the user), with most of the blue occurring at the highest probabilities (right side of the chart).

3.2.2 Area Under the Curve

We can more formally assess how well each model did in resampling by looking at the area under the receiver operating characteristic curve. A perfect model would receive a score of 1, while a model that cannot predict the outcome will default to a score of 0.5. The extent to which something is a good score depends on the setting, but generally anything in the .8 to .9 range is very good while the .7 to .8 range is perfectly acceptable.

3.2.3 Lift and Gain Curve

Another way to think about the model performance is to view its lift, or its ability to detect the positive outcomes over that of a null model. High lift indicates the model can much more quickly find all of the positive outcomes (in this case, games owned or played by the user), while a model with no lift is no better than random guessing. A gains chart is another way to view this.

3.2.4 Optimal Cutpoint and Confusion Matrix

While we are probably more interested in the lift provided by the models to evaluate their efficacy, we can also explore the optimal cutpoint if we wanted to define a hard threshold for identifying games a user will own vs not own.

The threshold we select depends on how we much we care about false positives (games the model predicts that the user does not own) vs false negatives (games the user owns that the model does not predict). We can toggle threshold to

3.2.5 Calibration

Finally, we can understand the performance of the model by examining its calibration. If the model assigns a probability of 5%, how often does the outcome actually occur? A well calibrated model is one in which the predicted probabilities reflect the probabilities we would observe in the actual data. We can assess the calibration of a model by grouping its predictions into bins and assessing how often we observe the outcome versus how often our model expects to observe the outcome.

A model that is well calibrated will closely follow the dashed line - its expected probabilities match that of the observed probabilities. A model that consistently underestimates the probability of the event will be over this dashed line, be a while a model that overestimates the probability will be under the dashed line.

3.3 Most and Least Likely Games

What games does the model think GOBbluth89 is most likely to own that are not in their collection?

What games does the model think GOBbluth89 is least likely to own that are in their collection?

3.4 Top Games by Year

Top 25 games most likely to be owned by the user in each year, highlighting in blue the games that the user has owned.

3.5 Interactive Table

This is an interactive table for the model’s predictions for the training set (from resampling).

4 Validating the Model

We’ll validate the model by looking at its predictions for games published in 2020. That is, how well did a model trained on a user’s collection through 2020 perform in predicting games for the user in 2020?

4.1 Model Assessment

4.2 Top Games from Validation Set

Table of top 50 games from 2020, highlighting games that the user owns.

5 Predicting Upcoming Games

We can then refit our model to the training and validation set in order to predict all upcoming games for the user.

5.1 Top Upcoming Games

Examine the top 100 upcoming games, highlighting in blue ones the user already owns.

5.2 Interactive Table of Upcoming Games